CPSC 545/445 (Autumn 2003) - Class 16: RNA and Protein Structure continued
Module 5: RNA and Protein Structure - Part 3
[lecture by Alena Shmygelska]

---
5.8 Protein Structure and Functions 

Basic protein chemistry
	- structure of amino acid
	- peptide bond
        
Functions of proteins and motivations for studying them:
        - structural role
        - enzamatic activity
        - energy transduction
        - protective role
        - transport
        etc.
Protein structure:

Four levels of protein structure:
	1. primary structure  - amino acid sequence
	2. secondary structure - alpha helices, beta sheets, turns, coil (local interactions due to hydrogen bonding between NH and CO groups of the backbone)
           (super-secondary structure - recurrent patterns of secondary structure, e.g. helix-turn-helix, leucine-zipper, alpha-helix hairpin)
        3. tertiary structure - three dimensional structure of the protein (e.g. globular shape of myoglobin)
        4. quarturnary structure - interactions between multiple chains (protein domains) in the protein (e.g. 4 subunits of hemoglobin)

---
5.9 Forces that determine native - functional state of the protein:

1. Hydrogen bonding force
H-bonds between NH and CO groups of the backbone, H-bonds of side-chains with the solvent. H atom is shared between two electronegative atoms.

2. Hydrophobic force
Non-polar side-chain hydrophobicity  (for e.g. Leucine, Isoleucine, and Valine) drives them away from the polar solvent into the interior of the protein. On the other hand polar side-chains (for e.g. Arginine, Aspartic acid, and Asparagine) make hydrogen bonds with polar solvent and therefore will be found on the surface of the proteins.

3. Electrostatic force
There are three types of interactions: charge-charge, charge-dipole, dipole-dipole. Interactions between charged side-groups, for example Aspartic acid (-) and Arginine (+). Charged amino acids form charge-dipole interactions with water (solvent) therefore they are found on the exterior surfaces of proteins.
         
4. Van der Waals force
There are both attractive and repulsive van der Waals forces. Repulsion is the result of electron-electron repulsion when atoms come too close. Attraction involve interaction between induced dipoles. Although van der Waals interactions are individually weak relative to other forces, there is a large number of them occurs in proteins.

5. Disulfide bridges
Disulfide bridges can be formed between 2 Cysteins

---
5.10 Computational problems related to protein structure:

1. Secondary structure prediction
2. Structural motif recognition
3. Tertiary structure prediction (protein folding problem)
4. Inverse protein folding or protein design
5. Docking problem (relates to rational drug design)

---
5.11 Protein Folding problem

Mystery of protein folding - Levinthal paradox:

How can a protein find its native state in time less than geological?

Thermodynamic hypothesis:
Native state of the protein is the state with the lowest Gibbs free energy.

Problem: Given an amino acid sequence S = s1,s2,s3 ... sN, 
	find conformation c' 
	(c' belongs to C - set off all possible conformations) 
	such that Energy(c') = min{Energy(c) | c in C}.

It was shown that this problem is NP-hard even for very simple lattice models.

Recall that the only two degrees of freedom that we have for each amino acid 
are psi and phi dihedral angles. Even if we consider only 3 states 
for each angle, 9 states for the pair, that would yield 9^n possible states 
for a protein chain of length n.


--
5.12 Protein folding approaches:

1. homology modelling
2. sequence - structure threading (or fold recognition)
3. ab initio prediction 


Homology Modelling:

Find a sequence in the PDB with sequence homology usually larger than 25-30% 
and a known structure. Based on the fact that closely related proteins 
have very similar folds. 

Drawback: for every unknown sequence there has to be a known homologue in 
the data base.


Threading:

When homology is weaker (less than 30%), but we can still find a distant homologue, 
use the structure of the known homologue as a seed - starting structure, 
for further refinements. 
Requires: good alignment between sequence and the structure from the data base. 
There are dynamic programming algorithms for threading 
(Bowie et al 1991, Lathrop et al 1996). 

Problem: alignment between between sequence and known structure have to be good.


Ab initio prediction:

When there is no known homology, use method based on physical and 
energetic principles to perform the search through the conformational space.
Models used are usually simplified: lattice models, reduced off-lattice models; 
energy-potential used is also simplistic. Search methods that are often used 
are Monte Carlo and Genetic Algorithms.

A special place in the description of methods belongs to "novel fold recognition"
- methods that participate in CASP (critical assesement of structure prediction) 
are not pure ab inition methods but use sequence homology in some way: 
secondary structure predicted by using data-base derived potentials, fragments 
from the existing protein structures, as well as multiple sequence alignment.

Currently most successful methods for protein structure prediction are 
homology-based comparative modelling and threading. But recently some novel 
fold recognition methods outperformed threading methods in CASP on some targets.

--
5.13 An example of a simplified model for protein structure prediction -

Hydrophobic Polar (HP) Lattice model:

Amino acid sequence of a protein is represented by a two letter alphabet: 
H - amino acids that are hydrophobic and 
P- amino acids that are polar [proposed by Dill 1985]. 

Residues are reduced to a single point on a lattice 
(2D - square lattice, 3D - cubic lattice). 
For most globular proteins (enzymes), hydrophobic force is the primary force 
that determines structure.

Energy potential is defined as a number of topological contacts 
between hydrophobic amino-acids that are not neighbours in the sequence. 

Among best known algorithms for 2D and 3D HP are various Monte Carlo algorithms, 
genetic algorithms, and Ant Colony Optimisation.


5.14 An example of novel fold recognition method that performs 
	very well in CASP is ROSETTA [Baker et. al 1996]
	
ROSETTA:
Structure are represented using simplified model consisting of heavy atoms 
of the main chain [N, Calpha, C, O] and a Cbeta atoms of the side chain. 
Energy potential used is data-base derived plus empirically based.
Three dimensional structure is generated but splicing together fragments 
(3 and 9 residue long) from a database of known proteins. 

The scoring function:

We seek the most probable structure for a protein given the amino acid sequence 
and the large number of examples of sequences with known structures 
in the protein database.

Using Bayes theorem, the probability of a structure given sequence:

P(structure|sequence) = P(structure)*(P(sequence|structure)/P(sequence))

Since we are comparing different structures for the same sequence P(sequence) 
is neglected. Since not all generated structures are likely to be proteins 
(for example highly expanded conformations): 

P(structure) = 0 if configuration contain overlap between atoms, and 
P(structure) = exp(-radius of gyration^2) for all other configurations.

To evaluate P(sequence|structure) we assume independence of pairs 
of positions (rather than individual positions):

P(sequence|structure) = PRODUCT P(aa_i, aa_j|r_ij) for all i<j, 
	where r_ij - is the distance between residues i and j. 

Using Bayes theorem again for a particular pair of residues i and j:

P(aa_i, aa_j|r_ij) = p(aa_i,aa_j)*(P(r_ij|aa_i, aa_j)/P(r_ij))

Thus: 
	P(structure|sequence) = exp(-radius of gyration^2)*PRODUCT P(r_ij|aa_i, aa_j)/P(r_ij)) for i<j    [Equation *]

Further expansions, not presented here lead to considerations of 
a variety of features of the local structural environment around residue i.  

1. Original sequence is divided into fragments and is matched up 
	with the structural fragments from the database of related sequences 
	(relatedness was found based on nearest neighbour clustering, 
	homologous sequences were removed beforehand)

2. And evaluated using scoring function (Equation *).

3. Monte Carlo simulated annealing is used to optimize conformation further:

	- Perturb conformation by replacing the torsion angles of a segment 
		of the chain with the torsion angles of a different protein fragment 
		with a related amino acid sequence;
        - evaluate resulting conformation using extanded form of [Equation *] 
          	and accept or reject based on the Metropolis criterion:
		accept if Energy of new conformation < Energy of the current
                accept if Energy of a new conformation > current Energy 
		with probability = exp(-(Energy_new-Energy_cur)/kT), 
		where k is Boltzman constant;
	- reduce temperature T;
        - repeat for a number of iterations;
     
4. repeat steps 1-3.

The structures that result from these simulations are clustered, 
and the centers of the largest clusters presented as predictions 
of the target structure. The idea is that a structure that emerges 
many times from independent simulations is likely to have favourable features.


---
Important concepts include:
- basic protein chemistry
- 4 levels of protein structure
- forces that determine protein structure
- structure prediction approaches (similarities and differences among them)
   1. homology modelling
   2. threading
   3. ab initio structure prediction
- Thermodynamic hypothesis
- HP model
- basic approach used in ROSETTA - Monte Carlo method
 

----
Resources:

[HP] Lingso, Pedersen paper available online at: http://citeseer.nj.nec.com/384609.html  

[ROSETTA] Simons, Kooperberg, Huang and Baker, Journal of Molecular Biology (1997) 268, 209-225.